-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Feature](iceberg) Implement rewrite_manifests procedure for Iceberg #59487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
run performance |
TPC-H: Total hot run time: 35722 ms |
TPC-DS: Total hot run time: 175094 ms |
ClickBench: Total hot run time: 27.01 s |
|
@suxiaogang223 Could you please review this PR when you have time? |
Cool👍. Sorry for the late reply, I will review it as soon as possible. |
...-core/src/main/java/org/apache/doris/datasource/iceberg/rewrite/ManifestRewriteExecutor.java
Outdated
Show resolved
Hide resolved
...-core/src/main/java/org/apache/doris/datasource/iceberg/rewrite/ManifestRewriteExecutor.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/doris/datasource/iceberg/action/IcebergRewriteManifestsAction.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/doris/datasource/iceberg/action/IcebergRewriteManifestsAction.java
Outdated
Show resolved
Hide resolved
|
I've reviewed the current implementation against the official Apache Iceberg Spark procedures documentation, and I'd like to suggest some changes to better align with the upstream specification. 1. Parameter DesignSpark Official Implementation
Current Doris Implementation
Issue & RecommendationThe current implementation is over-engineered compared to the official specification. I believe we should simplify the implementation to only support the basic
Proposed ChangesI recommend removing the following parameters:
Recommended Parameter Set
Note: We could add 2. Return Value SchemaSpark Official Output
Current Doris Output
IssueThe second output column is inconsistent with the official specification. The Spark version returns Recommended Change// Current implementation
return Lists.newArrayList(
new Column("rewritten_manifests_count", Type.INT, false,
"Number of data manifests rewritten by this command"),
new Column("total_data_manifests_count", Type.INT, false, // ❌ Incorrect
"Total number of data manifests before rewrite")
);
// Recommended change - align with Spark spec
return Lists.newArrayList(
new Column("rewritten_manifests_count", Type.INT, false,
"Number of manifests which were re-written by this command"),
new Column("added_manifests_count", Type.INT, false, // ✅ Aligned with spec
"Number of new manifest files which were written by this command")
);Note: The 3. Summary of Recommended Changes
4. Benefits of Alignment✅ Compatibility - Users familiar with Spark/Iceberg will have a consistent experience across engines 5. Additional Considerations5.1 Engine-Specific OptimizationsParameters like 5.2 Backward CompatibilityIf there are concerns about removing existing parameters, we could:
5.3 Implementation ReferenceThe recommended approach is to use Iceberg's RewriteManifests rm = table.rewriteManifests();
if (specId != null) {
rm.specId(specId);
}
rm.commit();This is simpler and more aligned with the official specification than the current predicate-based approach. 6. Example Usage (After Changes)-- Basic usage - rewrite all manifests for current spec
CALL catalog_name.system.rewrite_manifests('db.table');
-- Or via ALTER TABLE (Doris syntax)
ALTER TABLE catalog_name.db.table EXECUTE rewrite_manifests();
-- Optional: rewrite manifests for a specific spec
ALTER TABLE catalog_name.db.table EXECUTE rewrite_manifests('spec_id' = '1'); |
|
@Zhou-lemon I think our actions need to be consistent with Spark's official Iceberg implementation. Of course, some Spark-related parameters may not need to be implemented. You can see https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_manifests |
.../src/main/java/org/apache/doris/datasource/iceberg/action/IcebergRewriteManifestsAction.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/doris/datasource/iceberg/action/IcebergRewriteManifestsAction.java
Show resolved
Hide resolved
|
run buildall |
|
run buildall |
…_rewrite_manifest
|
run buildall |
TPC-H: Total hot run time: 31572 ms |
TPC-DS: Total hot run time: 174172 ms |
ClickBench: Total hot run time: 27.24 s |
|
Looks like perfect :) |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
suxiaogang223
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by anyone and no changes requested. |
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
What problem does this PR solve?
Issue Number: #58199
Problem Summary: This PR implements the rewrite_manifests procedure for Iceberg tables in Apache Doris. The feature allows users to optimize Iceberg table metadata by rewriting manifest files to improve query performance and reduce metadata overhead. This addresses the need for manifest file optimization in large Iceberg tables where numerous small manifest files can impact query planning performance.
Release note
Feature Implement rewrite_manifests procedure for Iceberg tables
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)